DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
The train.csv data set provided by DonorsChoose contains the following features:
| Feature | Description |
|---|---|
project_id |
A unique identifier for the proposed project. Example: p036502 |
project_title |
Title of the project. Examples:
|
project_grade_category |
Grade level of students for which the project is targeted. One of the following enumerated values:
|
project_subject_categories |
One or more (comma-separated) subject categories for the project from the following enumerated list of values:
Examples:
|
school_state |
State where school is located (Two-letter U.S. postal code). Example: WY |
project_subject_subcategories |
One or more (comma-separated) subject subcategories for the project. Examples:
|
project_resource_summary |
An explanation of the resources needed for the project. Example:
|
project_essay_1 |
First application essay* |
project_essay_2 |
Second application essay* |
project_essay_3 |
Third application essay* |
project_essay_4 |
Fourth application essay* |
project_submitted_datetime |
Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245 |
teacher_id |
A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56 |
teacher_prefix |
Teacher's title. One of the following enumerated values:
|
teacher_number_of_previously_posted_projects |
Number of project applications previously submitted by the same teacher. Example: 2 |
* See the section Notes on the Essay Data for more details about these features.
Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:
| Feature | Description |
|---|---|
id |
A project_id value from the train.csv file. Example: p036502 |
description |
Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25 |
quantity |
Quantity of the resource required. Example: 3 |
price |
Price of the resource required. Example: 9.95 |
Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:
The data set contains the following label (the value you will attempt to predict):
| Label | Description |
|---|---|
project_is_approved |
A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved. |
%%time
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import sqlite3
import math
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm
import os
from plotly import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter
# using 40k rows due to memory constraint
project_data = pd.read_csv('train_data.csv',nrows=40000)
resource_data = pd.read_csv('resources.csv')
print("Number of data points in train data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)
print("Number of data points in train data", resource_data.shape)
print(resource_data.columns.values)
resource_data.head(2)
project_subject_categories¶catogories = list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
temp = ""
# consider we have text like this "Math & Science, Warmth, Care & Hunger"
for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
temp = temp.replace('&','_') # we are replacing the & value into
cat_list.append(temp.strip())
project_data['clean_categories'] = cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)
from collections import Counter
my_counter = Counter()
for word in project_data['clean_categories'].values:
my_counter.update(word.split())
cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))
project_subject_subcategories¶sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
sub_cat_list = []
for i in sub_catogories:
temp = ""
# consider we have text like this "Math & Science, Warmth, Care & Hunger"
for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
temp = temp.replace('&','_')
sub_cat_list.append(temp.strip())
project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
my_counter.update(word.split())
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))
# merge two column text dataframe:
project_data["essay"] = project_data["project_essay_1"].map(str) +\
project_data["project_essay_2"].map(str) + \
project_data["project_essay_3"].map(str) + \
project_data["project_essay_4"].map(str)
project_data.head(2)
#### 1.4.2.3 Using Pretrained Models: TFIDF weighted W2V
# printing some random reviews
print(project_data['essay'].values[0])
print("="*50)
print(project_data['essay'].values[150])
print("="*50)
print(project_data['essay'].values[1000])
print("="*50)
print(project_data['essay'].values[20000])
print("="*50)
# https://stackoverflow.com/a/47091490/4084039
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
sent = decontracted(project_data['essay'].values[20000])
print(sent)
print("="*50)
# \r \n \t remove from string python: http://texthandler.com/info/remove-line-breaks-python/
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
print(sent)
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
print(sent)
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"]
# Combining all the above stundents
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['essay'].values):
sent = decontracted(sentance)
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
# https://gist.github.com/sebleier/554280
sent = ' '.join(e for e in sent.split() if e not in stopwords)
preprocessed_essays.append(sent.lower().strip())
project_data['preprocessed_essays'] = preprocessed_essays
# after preprocesing
preprocessed_essays[20000]
proj_essay_wrd_count = []
for word in project_data['preprocessed_essays']:
proj_essay_wrd_count.append(len(word.split()))
project_data['proj_essay_wrd_count'] = proj_essay_wrd_count
project_data.head(3)
# similarly you can preprocess the titles also
# printing some random essays.
print(project_data['project_title'].values[0])
print("="*50)
print(project_data['project_title'].values[150])
print("="*50)
print(project_data['project_title'].values[1000])
# Combining all the above statemennts
from tqdm import tqdm
preprocessed_titles = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['project_title'].values):
sent = decontracted(sentance)
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
# https://gist.github.com/sebleier/554280
sent = ' '.join(e for e in sent.split() if e not in stopwords)
preprocessed_titles.append(sent.lower().strip())
project_data['preprocessed_titles'] = preprocessed_titles
project_data.head(3)
proj_title_wrd_count = []
for word in project_data['preprocessed_titles']:
proj_title_wrd_count.append(len(word.split()))
project_data['proj_title_wrd_count'] = proj_title_wrd_count
project_data.head(3)
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
neg = []
pos = []
neu = []
compound = []
sid = SentimentIntensityAnalyzer()
for for_sentiment in tqdm(project_data['preprocessed_essays']):
neg.append(sid.polarity_scores(for_sentiment)['neg']) #Negative Sentiment score
pos.append(sid.polarity_scores(for_sentiment)['pos']) #Positive Sentiment score
neu.append(sid.polarity_scores(for_sentiment)['neu']) #Neutral Sentiment score
compound.append(sid.polarity_scores(for_sentiment)['compound']) #Compound Sentiment score
# Creating new features
project_data['Essay_neg_ss'] = neg
project_data['Essay_pos_ss'] = pos
project_data['Essay_neu_ss'] = neu
project_data['Essay_compound_ss'] = compound
project_data.head(3)
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace(" ", "_")
project_data['project_grade_category'].value_counts()
project_data['teacher_prefix'] = project_data['teacher_prefix'].str.replace(".","")
project_data['teacher_prefix'].value_counts()
project_data.columns
we are going to consider
- school_state : categorical data
- clean_categories : categorical data
- clean_subcategories : categorical data
- project_grade_category : categorical data
- teacher_prefix : categorical data
- project_title : text data
- text : text data
- project_resource_summary: text data (optinal)
- quantity : numerical (optinal)
- teacher_number_of_previously_posted_projects : numerical
- price : numerical
Y = project_data['project_is_approved'].values
project_data.drop(['project_is_approved'], axis=1, inplace=True)
X = project_data
X.head(1)
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, stratify=Y)
X_train, X_cv, Y_train, Y_cv = train_test_split(X_train, Y_train, test_size=0.33, stratify=Y_train)
# we use count vectorizer to convert the values into one hot encoded features
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
print(X_cv.shape, Y_cv.shape)
print("="*100)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_categories = CountVectorizer(vocabulary=list(sorted_cat_dict.keys()), lowercase=False, binary=True)
vectorizer_categories.fit(X_train['clean_categories'].values)
categories_one_hot_train = vectorizer_categories.fit_transform(X_train['clean_categories'].values)
categories_one_hot_test = vectorizer_categories.transform(X_test['clean_categories'].values)
categories_one_hot_cv = vectorizer_categories.transform(X_cv['clean_categories'].values)
print("After vectorizations")
print("Shape of Train data - one hot encoding ",categories_one_hot_train.shape)
print("Shape of Test data - one hot encoding ",categories_one_hot_test.shape)
print("Shape of CV data - one hot encoding ",categories_one_hot_cv.shape)
print("="*100)
print(vectorizer_categories.get_feature_names())
print("="*100)
# we use count vectorizer to convert the values into one
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
print(X_cv.shape, Y_cv.shape)
print("="*100)
vectorizer_sub_cat = CountVectorizer(vocabulary=list(sorted_sub_cat_dict.keys()), lowercase=False, binary=True)
vectorizer_sub_cat.fit(X_train['clean_subcategories'].values)
sub_cat_one_hot_train = vectorizer_sub_cat.fit_transform(X_train['clean_subcategories'].values)
sub_cat_one_hot_test = vectorizer_sub_cat.transform(X_test['clean_subcategories'].values)
sub_cat_one_hot_cv = vectorizer_sub_cat.transform(X_cv['clean_subcategories'].values)
print("After vectorizations")
print("Shape of Train data - one hot encoding ",sub_cat_one_hot_train.shape)
print("Shape of Test data - one hot encoding",sub_cat_one_hot_test.shape)
print("Shape of CV data - one hot encoding",sub_cat_one_hot_cv.shape)
print("="*100)
print(vectorizer_sub_cat.get_feature_names())
print("="*100)
# you can do the similar thing with state, teacher_prefix and project_grade_category also
my_counter = Counter()
for state in project_data['school_state'].values:
my_counter.update(state.split())
school_state_cat_dict = dict(my_counter)
sorted_school_state_cat_dict = dict(sorted(school_state_cat_dict.items(), key=lambda kv: kv[1]))
## we use count vectorizer to convert the values into one hot encoded features
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
print(X_cv.shape, Y_cv.shape)
print("="*100)
vectorizer_school_state = CountVectorizer(vocabulary=list(sorted_school_state_cat_dict.keys()), lowercase=False, binary=True)
vectorizer_school_state.fit(X_train['school_state'].values)
school_state_one_hot_train = vectorizer_school_state.fit_transform(X_train['school_state'].values)
school_state_one_hot_test = vectorizer_school_state.transform(X_test['school_state'].values)
school_state_one_hot_cv = vectorizer_school_state.transform(X_cv['school_state'].values)
print("After vectorizations")
print("Shape of Train data - one hot encoding",school_state_one_hot_train.shape)
print("Shape of Test data - one hot encoding",school_state_one_hot_test.shape)
print("Shape of CV data - one hot encoding",school_state_one_hot_cv.shape)
print("="*100)
print(vectorizer_school_state.get_feature_names())
print("="*100)
my_counter = Counter()
for project_grade in project_data['project_grade_category'].values:
my_counter.update(project_grade.split())
project_grade_cat_dict = dict(my_counter)
sorted_project_grade_cat_dict = dict(sorted(project_grade_cat_dict.items(), key=lambda kv: kv[1]))
## we use count vectorizer to convert the values into one hot encoded features
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
print(X_cv.shape, Y_cv.shape)
print("="*100)
vectorizer_project_grade_cat = CountVectorizer(vocabulary=list(sorted_project_grade_cat_dict.keys()), lowercase=False, binary=True)
vectorizer_project_grade_cat.fit(X_train['project_grade_category'].values)
project_grade_cat_one_hot_train = vectorizer_project_grade_cat.fit_transform(X_train['project_grade_category'].values)
project_grade_cat_one_hot_test = vectorizer_project_grade_cat.transform(X_test['project_grade_category'].values)
project_grade_cat_one_hot_cv = vectorizer_project_grade_cat.transform(X_cv['project_grade_category'].values)
print("After vectorizations")
print("="*100)
print("Shape of Train data - one hot encoding",project_grade_cat_one_hot_train.shape)
print("Shape of Test data - one hot encoding",project_grade_cat_one_hot_test.shape)
print("Shape of CV data - one hot encoding",project_grade_cat_one_hot_cv.shape)
print("="*100)
print(vectorizer_project_grade_cat.get_feature_names())
my_counter = Counter()
for teacher_prefix in project_data['teacher_prefix'].values:
teacher_prefix = str(teacher_prefix)
my_counter.update(teacher_prefix.split())
teacher_prefix_cat_dict = dict(my_counter)
sorted_teacher_prefix_cat_dict = dict(sorted(teacher_prefix_cat_dict.items(), key=lambda kv: kv[1]))
vectorizer_teacher_prefix_cat = CountVectorizer(vocabulary=list(sorted_teacher_prefix_cat_dict.keys()), lowercase=False, binary=True)
vectorizer_teacher_prefix_cat.fit(X_train['teacher_prefix'].values.astype("U"))
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
print(X_cv.shape, Y_cv.shape)
print("="*100)
teacher_prefix_cat_one_hot_train = vectorizer_teacher_prefix_cat.fit_transform(X_train['teacher_prefix'].values.astype("U"))
teacher_prefix_cat_one_hot_test = vectorizer_teacher_prefix_cat.transform(X_test['teacher_prefix'].values.astype("U"))
teacher_prefix_cat_one_hot_cv = vectorizer_teacher_prefix_cat.transform(X_cv['teacher_prefix'].values.astype("U"))
print("After vectorizations")
print("="*100)
print("Shape of Train data - one hot encoding",teacher_prefix_cat_one_hot_train.shape)
print("Shape of Test data - one hot encoding ",teacher_prefix_cat_one_hot_test.shape)
print("Shape of CV data - one hot encoding ",teacher_prefix_cat_one_hot_cv.shape)
print("="*100)
print(vectorizer_teacher_prefix_cat.get_feature_names())
# We are considering only the words which appeared in at least 10 documents(rows or projects).
vectorizer_essay_bow = CountVectorizer(ngram_range=(2, 2),min_df=10,max_features=5000)
vectorizer_essay_bow.fit(X_train['preprocessed_essays'])
# BOW for essays Train Data
essay_bow_train = vectorizer_essay_bow.fit_transform(X_train['preprocessed_essays'])
print("Shape of matrix for TRAIN data ",essay_bow_train.shape)
# BOW for essays Test Data
essay_bow_test = vectorizer_essay_bow.transform(X_test['preprocessed_essays'])
print("Shape of matrix for TEST data",essay_bow_test.shape)
# BOW for essays CV Data
essay_bow_cv = vectorizer_essay_bow.transform(X_cv['preprocessed_essays'])
print("Shape of matrix for CV data ",essay_bow_cv.shape)
vectorizer_title_bow = CountVectorizer(ngram_range=(2, 2),min_df=10,max_features=5000)
vectorizer_title_bow.fit(X_train['preprocessed_titles'])
# BOW for title Train Data
title_bow_train = vectorizer_title_bow.fit_transform(X_train['preprocessed_titles'])
print("Shape of matrix for TRAIN data ",title_bow_train.shape)
# BOW for title Test Data
title_bow_test = vectorizer_title_bow.transform(X_test['preprocessed_titles'])
print("Shape of matrix for TEST data",title_bow_test.shape)
# BOW for title CV Data
title_bow_cv = vectorizer_title_bow.transform(X_cv['preprocessed_titles'])
print("Shape of matrix for CV data ",title_bow_cv.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_essay_tfidf = TfidfVectorizer(ngram_range=(2, 2),min_df=10,max_features=5000)
vectorizer_essay_tfidf.fit(X_train['preprocessed_essays'])
#tidf Train Data
essay_tfidf_train = vectorizer_essay_tfidf.fit_transform(X_train['preprocessed_essays'])
print("Shape of matrix for TRAIN data",essay_tfidf_train.shape)
#tidf Test Data
essay_tfidf_test = vectorizer_essay_tfidf.transform(X_test['preprocessed_essays'])
print("Shape of matrix for TEST data",essay_tfidf_test.shape)
#tidf CV Data
essay_tfidf_cv = vectorizer_essay_tfidf.transform(X_cv['preprocessed_essays'])
print("Shape of matrix for CV data",essay_tfidf_cv.shape)
vectorizer_title_tfidf = TfidfVectorizer(ngram_range=(2, 2),min_df=10,max_features=5000)
vectorizer_title_tfidf.fit(X_train['preprocessed_titles'])
#tidf Train Data
title_tfidf_train = vectorizer_title_tfidf.fit_transform(X_train['preprocessed_titles'])
print("Shape of matrix for TRAIN data",title_tfidf_train.shape)
#tidf Test Data
title_tfidf_test = vectorizer_title_tfidf.transform(X_test['preprocessed_titles'])
print("Shape of matrix for TEST data",title_tfidf_test.shape)
#tidf CV Data
title_tfidf_cv = vectorizer_title_tfidf.transform(X_cv['preprocessed_titles'])
print("Shape of matrix for CV data",title_tfidf_cv.shape)
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# make sure you have the glove_vectors file
with open('glove_vectors', 'rb') as f:
model = pickle.load(f)
glove_words = set(model.keys())
# average Word2Vec Function
# compute average word2vec for each review.
# the avg-w2v for each sentence/review is stored in this list
def avg_w2v_vectors_func(sentance):
vector = np.zeros(300) # as word vectors are of zero length
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if word in glove_words:
vector += model[word]
cnt_words += 1
if cnt_words != 0:
vector /= cnt_words
return vector
essay_avg_w2v_train = []
essay_avg_w2v_test = []
essay_avg_w2v_cv = []
for sentence in tqdm(X_train['preprocessed_essays']):
essay_avg_w2v_train.append(avg_w2v_vectors_func(sentance)) # Avg-w2v for Train data
# Avg-w2v for Train data
print("len(essay_avg_w2v_train):",len(essay_avg_w2v_train))
print("len(essay_avg_w2v_train[0])",len(essay_avg_w2v_train[0]))
for sentence in tqdm(X_test['preprocessed_essays']):
essay_avg_w2v_test.append(avg_w2v_vectors_func(sentance)) # Avg-w2v for Test data
# Avg-w2v for Test data
print("len(essay_avg_w2v_test):",len(essay_avg_w2v_test))
print("len(essay_avg_w2v_test[0])",len(essay_avg_w2v_test[0]))
for sentence in tqdm(X_cv['preprocessed_essays']):
essay_avg_w2v_cv.append(avg_w2v_vectors_func(sentance)) # Avg-w2v for CV data
# Avg-w2v for CV data
print("len(essay_avg_w2v_cv):",len(essay_avg_w2v_cv))
print("len(essay_avg_w2v_cv[0])",len(essay_avg_w2v_cv[0]))
title_avg_w2v_train = []
title_avg_w2v_test = []
for sentence in tqdm(X_train['preprocessed_titles']):
title_avg_w2v_train.append(avg_w2v_vectors_func(sentance)) # Avg-w2v for Train data
# Avg-w2v for Train data
print("len(title_avg_w2v_train):",len(title_avg_w2v_train))
print("len(title_avg_w2v_train[0])",len(title_avg_w2v_train[0]))
for sentence in tqdm(X_test['preprocessed_titles']):
title_avg_w2v_test.append(avg_w2v_vectors_func(sentance)) # Avg-w2v for Test data
# Avg-w2v for Test data
print("len(title_avg_w2v_test):",len(title_avg_w2v_test))
print("len(title_avg_w2v_test[0])",len(title_avg_w2v_test[0]))
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
tfidf_model = TfidfVectorizer()
tfidf_model.fit(X_train['preprocessed_essays'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())
# Compute TFIDF weighted W2V for each sentence of the review.
def tf_idf_weight_func(sentence): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
tf_idf_weight =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if (word in glove_words) and (word in tfidf_words):
vec = model[word] # getting the vector for each word
# here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
vector += (vec * tf_idf) # calculating tfidf weighted w2v
tf_idf_weight += tf_idf
if tf_idf_weight != 0:
vector /= tf_idf_weight
return vector
essay_tfidf_w2v_train = []
essay_tfidf_w2v_test = []
essay_tfidf_w2v_cv = []
for sentence in tqdm(X_train['preprocessed_essays']):
essay_tfidf_w2v_train.append(tf_idf_weight_func(sentance)) # TFIDF weighted W2V for Train data
print("len(essay_tfidf_w2v_train)",len(essay_tfidf_w2v_train))
print("len(essay_tfidf_w2v_train[0])",len(essay_tfidf_w2v_train[0]))
for sentence in tqdm(X_test['preprocessed_essays']):
essay_tfidf_w2v_test.append(tf_idf_weight_func(sentance)) # TFIDF weighted W2V for Test data
print("len(essay_tfidf_w2v_test)",len(essay_tfidf_w2v_test))
print("len(essay_tfidf_w2v_test[0])",len(essay_tfidf_w2v_test[0]))
for sentence in tqdm(X_cv['preprocessed_essays']):
essay_tfidf_w2v_cv.append(tf_idf_weight_func(sentance)) # TFIDF weighted W2V for CV data
print("len(essay_tfidf_w2v_cv)",len(essay_tfidf_w2v_cv))
print("len(essay_tfidf_w2v_cv[0])",len(essay_tfidf_w2v_cv[0]))
title_avg_w2v_train = []
title_avg_w2v_test = []
title_avg_w2v_cv = []
for sentence in tqdm(X_train['preprocessed_titles']):
title_avg_w2v_train.append(avg_w2v_vectors_func(sentance)) # Avg-w2v for Train data
# Avg-w2v for Train data
print("len(title_avg_w2v_train):",len(title_avg_w2v_train))
print("len(title_avg_w2v_train[0])",len(title_avg_w2v_train[0]))
for sentence in tqdm(X_test['preprocessed_titles']):
title_avg_w2v_test.append(avg_w2v_vectors_func(sentance)) # Avg-w2v for Test data
# Avg-w2v for Test data
print("len(title_avg_w2v_test):",len(title_avg_w2v_test))
print("len(title_avg_w2v_test[0])",len(title_avg_w2v_test[0]))
for sentence in tqdm(X_cv['preprocessed_titles']):
title_avg_w2v_cv.append(avg_w2v_vectors_func(sentance)) # Avg-w2v for CV data
# Avg-w2v for CV data
print("len(title_avg_w2v_cv):",len(title_avg_w2v_cv))
print("len(title_avg_w2v_cv[0])",len(title_avg_w2v_cv[0]))
title_tfidf_w2v_train = []
title_tfidf_w2v_test = []
title_tfidf_w2v_cv = []
for sentence in tqdm(X_train['preprocessed_titles']):
title_tfidf_w2v_train.append(tf_idf_weight_func(sentance)) # TFIDF weighted W2V for Train data
print("len(title_tfidf_w2v_train)",len(title_tfidf_w2v_train))
print("len(title_tfidf_w2v_train[0])",len(title_tfidf_w2v_train[0]))
for sentence in tqdm(X_test['preprocessed_titles']):
title_tfidf_w2v_test.append(tf_idf_weight_func(sentance)) # TFIDF weighted W2V for Test data
print("len(title_tfidf_w2v_test)",len(title_tfidf_w2v_test))
print("len(title_tfidf_w2v_test[0])",len(title_tfidf_w2v_test[0]))
for sentence in tqdm(X_cv['preprocessed_titles']):
title_tfidf_w2v_cv.append(tf_idf_weight_func(sentance)) # TFIDF weighted W2V for CV data
print("len(title_tfidf_w2v_cv)",len(title_tfidf_w2v_cv))
print("len(title_tfidf_w2v_cv[0])",len(title_tfidf_w2v_cv[0]))
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
X_train = pd.merge(X_train, price_data, on='id', how='left')
X_test = pd.merge(X_test, price_data, on='id', how='left')
X_cv = pd.merge(X_cv, price_data, on='id', how='left')
from sklearn.preprocessing import Normalizer
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
print(X_cv.shape, Y_cv.shape)
print("="*100)
normalizer = Normalizer()
# normalizer.fit(X_train['price'].values)
# this will rise an error Expected 2D array, got 1D array instead:
# array=[105.22 215.96 96.01 ... 368.98 80.53 709.67].
# Reshape your data either using
# array.reshape(-1, 1) if your data has a single feature
# array.reshape(1, -1) if it contains a single sample.
normalizer.fit(X_train['price'].values.reshape(-1,1))
price_data_train = normalizer.fit_transform(X_train['price'].values.reshape(-1,1))
price_data_test = normalizer.transform(X_test['price'].values.reshape(-1,1))
price_data_cv = normalizer.transform(X_cv['price'].values.reshape(-1,1))
print("After vectorizations")
print("="*100)
print(price_data_train.shape, Y_train.shape)
print(price_data_test.shape, Y_test.shape)
print(price_data_cv.shape, Y_cv.shape)
print("="*100)
normalizer = Normalizer()
# normalizer.fit(X_train['price'].values)
# this will rise an error Expected 2D array, got 1D array instead:
# array=[105.22 215.96 96.01 ... 368.98 80.53 709.67].
# Reshape your data either using
# array.reshape(-1, 1) if your data has a single feature
# array.reshape(1, -1) if it contains a single sample.
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
print(X_cv.shape, Y_cv.shape)
print("="*100)
normalizer.fit(X_train['quantity'].values.reshape(-1,1))
quant_train = normalizer.fit_transform(X_train['quantity'].values.reshape(-1,1))
quant_cv = normalizer.transform(X_cv['quantity'].values.reshape(-1,1))
quant_test = normalizer.transform(X_test['quantity'].values.reshape(-1,1))
print("="*100)
print("After vectorizations")
print(quant_train.shape, Y_train.shape)
print(quant_cv.shape, Y_cv.shape)
print(quant_test.shape, Y_test.shape)
print("="*100)
normalizer = Normalizer()
# normalizer.fit(X_train['price'].values)
# this will rise an error Expected 2D array, got 1D array instead:
# array=[105.22 215.96 96.01 ... 368.98 80.53 709.67].
# Reshape your data either using
# array.reshape(-1, 1) if your data has a single feature
# array.reshape(1, -1) if it contains a single sample.
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
print(X_cv.shape, Y_cv.shape)
print("="*100)
normalizer.fit(X_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
prev_no_projects_train = normalizer.fit_transform(X_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
prev_no_projects_cv = normalizer.transform(X_cv['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
prev_no_projects_test = normalizer.transform(X_test['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
print("="*100)
print("After vectorizations")
print(prev_no_projects_train.shape, Y_train.shape)
print(prev_no_projects_cv.shape, Y_cv.shape)
print(prev_no_projects_test.shape, Y_test.shape)
print("="*100)
normalizer = Normalizer()
normalizer.fit(X_train['proj_title_wrd_count'].values.reshape(-1,1))
title_cnt_train = normalizer.fit_transform(X_train['proj_title_wrd_count'].values.reshape(-1,1))
title_cnt_test = normalizer.transform(X_test['proj_title_wrd_count'].values.reshape(-1,1))
print("="*100)
print("After vectorizations")
print(title_cnt_train.shape, Y_train.shape)
print(title_cnt_test.shape, Y_test.shape)
print("="*100)
normalizer = Normalizer()
normalizer.fit(X_train['proj_essay_wrd_count'].values.reshape(-1,1))
essay_cnt_train = normalizer.fit_transform(X_train['proj_essay_wrd_count'].values.reshape(-1,1))
essay_cnt_test = normalizer.transform(X_test['proj_essay_wrd_count'].values.reshape(-1,1))
print("="*100)
print("After vectorizations")
print(title_cnt_train.shape, Y_train.shape)
print(title_cnt_test.shape, Y_test.shape)
print("="*100)
normalizer = Normalizer()
normalizer.fit(X_train['Essay_neg_ss'].values.reshape(-1,1))
essay_neg_train = normalizer.fit_transform(X_train['Essay_neg_ss'].values.reshape(-1,1))
essay_neg_test = normalizer.transform(X_test['Essay_neg_ss'].values.reshape(-1,1))
print("="*100)
print("After vectorizations")
print(essay_neg_train.shape, Y_train.shape)
print(essay_neg_test.shape, Y_test.shape)
print("="*100)
normalizer = Normalizer()
normalizer.fit(X_train['Essay_pos_ss'].values.reshape(-1,1))
essay_pos_train = normalizer.fit_transform(X_train['Essay_pos_ss'].values.reshape(-1,1))
essay_pos_test = normalizer.transform(X_test['Essay_pos_ss'].values.reshape(-1,1))
print("="*100)
print("After vectorizations")
print(essay_pos_train.shape, Y_train.shape)
print(essay_pos_test.shape, Y_test.shape)
print("="*100)
normalizer = Normalizer()
normalizer.fit(X_train['Essay_neu_ss'].values.reshape(-1,1))
essay_neu_train = normalizer.fit_transform(X_train['Essay_neu_ss'].values.reshape(-1,1))
essay_neu_test = normalizer.transform(X_test['Essay_neu_ss'].values.reshape(-1,1))
print("="*100)
print("After vectorizations")
print(essay_neu_train.shape, Y_train.shape)
print(essay_neu_test.shape, Y_test.shape)
print("="*100)
normalizer = Normalizer()
normalizer.fit(X_train['Essay_compound_ss'].values.reshape(-1,1))
essay_compound_train = normalizer.fit_transform(X_train['Essay_compound_ss'].values.reshape(-1,1))
essay_compund_test = normalizer.transform(X_test['Essay_compound_ss'].values.reshape(-1,1))
print("="*100)
print("After vectorizations")
print(essay_compound_train.shape, Y_train.shape)
print(essay_compund_test.shape, Y_test.shape)
print("="*100)

from scipy.sparse import hstack
X_train_merge = hstack((categories_one_hot_train, sub_cat_one_hot_train, school_state_one_hot_train, project_grade_cat_one_hot_train, teacher_prefix_cat_one_hot_train, price_data_train, quant_train, prev_no_projects_train,title_bow_train, essay_bow_train)).tocsr()
X_test_merge = hstack((categories_one_hot_test, sub_cat_one_hot_test, school_state_one_hot_test, project_grade_cat_one_hot_test, teacher_prefix_cat_one_hot_test, price_data_test, quant_test, prev_no_projects_test,title_bow_test, essay_bow_test)).tocsr()
print("Final Data matrix")
print("="*100)
print(X_train_merge.shape, Y_train.shape)
print(X_test_merge.shape, Y_test.shape)
print("="*100)
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
parameters = {'max_depth':[1, 5, 10, 50, 100, 500], 'min_samples_split': [5, 10, 100, 500]}
clf = GridSearchCV(dt, parameters, cv= 10, scoring='roc_auc')
clf.fit(X_train_merge,Y_train)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score']
cv_auc_std= clf.cv_results_['std_test_score']
parm_max_depth = clf.cv_results_['param_max_depth']
param_min_samples_split = clf.cv_results_['param_min_samples_split']
#https://towardsdatascience.com/using-3d-visualizations-to-tune-hyperparameters-of-ml-models-with-python-ba2885eab2e9
df_gridsearch = pd.DataFrame(clf.cv_results_)
#Maximum AUC score on train set VS max_depth, min_samples_split
max_scores = df_gridsearch.groupby(['param_max_depth',
'param_min_samples_split']).max().unstack()[['mean_test_score', 'mean_train_score']]
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on train set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_train_score, annot=True, fmt='.4g');
plt.title(title);
#Maximum AUC score on test set VS max_depth, min_samples_split
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on test set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_test_score, annot=True, fmt='.4g');
plt.title(title);
from sklearn.metrics import roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth = 10, min_samples_split = 500,class_weight='balanced')
clf = dt.fit(X_train_merge, Y_train)
y_train_pred = dt.predict_proba(X_train_merge)[:,1]
y_test_pred = dt.predict_proba(X_test_merge)[:,1]
train_fpr, train_tpr, tr_thresholds = roc_curve(Y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(Y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("True Positive Rate(TPR)")
plt.ylabel("False Positive Rate(FPR)")
plt.title("AUC")
plt.grid(True)
plt.show()
a = vectorizer_categories.get_feature_names()
b = vectorizer_sub_cat.get_feature_names()
c = vectorizer_school_state.get_feature_names()
d = vectorizer_project_grade_cat.get_feature_names()
e = vectorizer_teacher_prefix_cat.get_feature_names()
f = vectorizer_title_tfidf.get_feature_names()
g = vectorizer_essay_tfidf.get_feature_names()
from itertools import chain
feature_names_bow = list(chain(
a,
b,
c,
d,
e,
["Price","Quantity","Prec_no_projts"],
f,
g))
len(feature_names_bow)
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=3)
clf = dt.fit(X_train_merge,Y_train)
import graphviz
from sklearn import tree
from graphviz import Source
dt_data = tree.export_graphviz(dt, out_file=None, feature_names=feature_names_bow)
graph = graphviz.Source(dt_data)
graph.render("Bow tree_set1",view = True)
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("="*100)
print("Test confusion matrix")
print(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
print("="*100)
conf_mat_BOW_train = pd.DataFrame(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_mat_BOW_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
conf_mat_BOW_test= pd.DataFrame(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_mat_BOW_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
import numpy as np
fp_rows = []
y_train_label = []
for i in range(len(y_train_pred)):
if (y_train_pred[i] >= 0.562):
y_train_label.append(1)
if (Y_train[i] == 0 and y_train_label[i] == 1):
fp_rows.append(i)
else:
y_train_label.append(0)
tp_freq = {}
df_bow = pd.DataFrame(X_train_merge.todense())
df_bow_fp = df_bow.iloc[fp_rows,:]
df_bow_fp.columns = feature_names_bow
tp_freq = (df_bow_fp.sum()).to_dict()
##https://www.tutorialspoint.com/create-word-cloud-using-python
from wordcloud import WordCloud
wordcloud = WordCloud(width = 1000, height = 500, background_color ='white').generate_from_frequencies(tp_freq)
plt.figure(figsize=(25,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()
X_train['price'][fp_rows]
plt.boxplot(X_train['price'][fp_rows])
plt.title('Plot the box plot with the `price` of these `false positive data points')
plt.xlabel('Rejected projects')
plt.ylabel('Price')
plt.grid(True)
plt.show()
X_train['teacher_number_of_previously_posted_projects'][fp_rows]
plt.figure(figsize=(15,5))
sns.distplot(X_train['teacher_number_of_previously_posted_projects'][fp_rows], hist=False, label="False Positive points")
plt.title('Pdf with the teacher_number_of_previously_posted_projects of these false positive data points')
plt.xlabel('Teacher_number_of_previously_posted_projects')
plt.ylabel('Likelyhood')
plt.legend()
plt.show()
from scipy.sparse import hstack
X_train_merge = hstack((categories_one_hot_train, sub_cat_one_hot_train, school_state_one_hot_train, project_grade_cat_one_hot_train, teacher_prefix_cat_one_hot_train, price_data_train, quant_train, prev_no_projects_train,title_tfidf_train, essay_tfidf_train)).tocsr()
X_test_merge = hstack((categories_one_hot_test, sub_cat_one_hot_test, school_state_one_hot_test, project_grade_cat_one_hot_test, teacher_prefix_cat_one_hot_test, price_data_test, quant_test, prev_no_projects_test,title_tfidf_test, essay_tfidf_test)).tocsr()
print("Final Data matrix")
print("="*100)
print(X_train_merge.shape, Y_train.shape)
print(X_test_merge.shape, Y_test.shape)
print("="*100)
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
parameters = {'max_depth':[1, 5, 10, 50, 100, 500], 'min_samples_split': [5, 10, 100, 500]}
clf = GridSearchCV(dt, parameters, cv= 10, scoring='roc_auc')
clf.fit(X_train_merge,Y_train)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score']
cv_auc_std= clf.cv_results_['std_test_score']
parm_max_depth = clf.cv_results_['param_max_depth']
param_min_samples_split = clf.cv_results_['param_min_samples_split']
#https://towardsdatascience.com/using-3d-visualizations-to-tune-hyperparameters-of-ml-models-with-python-ba2885eab2e9
df_gridsearch = pd.DataFrame(clf.cv_results_)
#Maximum AUC score on train set VS max_depth, min_samples_split
max_scores = df_gridsearch.groupby(['param_max_depth',
'param_min_samples_split']).max().unstack()[['mean_test_score', 'mean_train_score']]
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on train set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_train_score, annot=True, fmt='.4g');
plt.title(title);
#Maximum AUC score on test set VS max_depth, min_samples_split
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on test set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_test_score, annot=True, fmt='.4g');
plt.title(title);
from sklearn.metrics import roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth = 50, min_samples_split = 500,class_weight='balanced')
clf = dt.fit(X_train_merge, Y_train)
y_train_pred = dt.predict_proba(X_train_merge)[:,1]
y_test_pred = dt.predict_proba(X_test_merge)[:,1]
train_fpr, train_tpr, tr_thresholds = roc_curve(Y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(Y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("True Positive Rate(TPR)")
plt.ylabel("False Positive Rate(FPR)")
plt.title("AUC")
plt.grid(True)
plt.show()
a = vectorizer_categories.get_feature_names()
b = vectorizer_sub_cat.get_feature_names()
c = vectorizer_school_state.get_feature_names()
d = vectorizer_project_grade_cat.get_feature_names()
e = vectorizer_teacher_prefix_cat.get_feature_names()
f = vectorizer_title_tfidf.get_feature_names()
g = vectorizer_essay_tfidf.get_feature_names()
from itertools import chain
feature_names_tfidf = list(chain(
a,
b,
c,
d,
e,
["Price","Quantity","Prec_no_projts"],
f,
g))
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=3)
clf = dt.fit(X_train_merge,Y_train)
import graphviz
from sklearn import tree
from graphviz import Source
dt_data = tree.export_graphviz(dt, out_file=None, feature_names=feature_names_tfidf)
graph = graphviz.Source(dt_data)
graph.render("Bow tree_set2",view = True)
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr)))
print("="*100)
print("Test confusion matrix")
print(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)))
print("="*100)
# heat map for train data
conf_matr_df_tfidf_train = pd.DataFrame(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_matr_df_tfidf_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#Heat map for test data
conf_matr_df_tfidf_test = pd.DataFrame(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_matr_df_tfidf_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
import numpy as np
fp_rows = []
y_train_label = []
for i in range(len(y_train_pred)):
if (y_train_pred[i] >= 0.349):
y_train_label.append(1)
if (Y_train[i] == 0 and y_train_label[i] == 1):
fp_rows.append(i)
else:
y_train_label.append(0)
tp_freq = {}
df_tfidf = pd.DataFrame(X_train_merge.todense())
df_tfidf_fp = df_tfidf.iloc[fp_rows,:]
df_tfidf_fp.columns = feature_names_tfidf
df_tfidf_fp.head(3)
tp_freq = (df_tfidf_fp.sum()).to_dict()
from wordcloud import WordCloud
##https://www.tutorialspoint.com/create-word-cloud-using-python
wordcloud = WordCloud(width = 1000, height = 500, background_color ='white').generate_from_frequencies(tp_freq)
plt.figure(figsize=(25,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()
X_train['price'][fp_rows]
plt.boxplot(X_train['price'][fp_rows])
plt.title('Plot the box plot with the `price` of these `false positive data points')
plt.xlabel('Rejected projects')
plt.ylabel('Price')
plt.grid(True)
plt.show()
X_train['teacher_number_of_previously_posted_projects'][fp_rows]
plt.figure(figsize=(15,5))
sns.distplot(X_train['teacher_number_of_previously_posted_projects'][fp_rows], hist=False, label="False Positive points")
plt.title('Pdf with the teacher_number_of_previously_posted_projects of these false positive data points')
plt.xlabel('Teacher_number_of_previously_posted_projects')
plt.ylabel('Likelyhood')
plt.legend()
plt.show()
from scipy.sparse import hstack
X_train_merge = hstack((categories_one_hot_train, sub_cat_one_hot_train, school_state_one_hot_train, project_grade_cat_one_hot_train, teacher_prefix_cat_one_hot_train, price_data_train, quant_train, prev_no_projects_train,title_avg_w2v_train, essay_avg_w2v_train)).tocsr()
X_test_merge = hstack((categories_one_hot_test, sub_cat_one_hot_test, school_state_one_hot_test, project_grade_cat_one_hot_test, teacher_prefix_cat_one_hot_test, price_data_test, quant_test, prev_no_projects_test,title_avg_w2v_test, essay_avg_w2v_test)).tocsr()
print("Final Data matrix")
print("="*100)
print(X_train_merge.shape, Y_train.shape)
print(X_test_merge.shape, Y_test.shape)
print("="*100)
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
parameters = {'max_depth':[1, 5, 10, 50, 100, 500], 'min_samples_split': [5, 10, 100, 500]}
clf = GridSearchCV(dt, parameters, cv= 10, scoring='roc_auc')
clf.fit(X_train_merge,Y_train)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score']
cv_auc_std= clf.cv_results_['std_test_score']
parm_max_depth = clf.cv_results_['param_max_depth']
param_min_samples_split = clf.cv_results_['param_min_samples_split']
#https://towardsdatascience.com/using-3d-visualizations-to-tune-hyperparameters-of-ml-models-with-python-ba2885eab2e9
df_gridsearch = pd.DataFrame(clf.cv_results_)
#Maximum AUC score on train set VS max_depth, min_samples_split
max_scores = df_gridsearch.groupby(['param_max_depth',
'param_min_samples_split']).max().unstack()[['mean_test_score', 'mean_train_score']]
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on train set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_train_score, annot=True, fmt='.4g');
plt.title(title);
#Maximum AUC score on test set VS max_depth, min_samples_split
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on test set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_test_score, annot=True, fmt='.4g');
plt.title(title);
from sklearn.metrics import roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth = 5, min_samples_split = 10,class_weight='balanced')
clf = dt.fit(X_train_merge, Y_train)
y_train_pred = dt.predict_proba(X_train_merge)[:,1]
y_test_pred = dt.predict_proba(X_test_merge)[:,1]
train_fpr, train_tpr, tr_thresholds = roc_curve(Y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(Y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("True Positive Rate(TPR)")
plt.ylabel("False Positive Rate(FPR)")
plt.title("AUC")
plt.grid(True)
plt.show()
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr)))
print("="*100)
print("Test confusion matrix")
print(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)))
print("="*100)
# heat map for train data
conf_matr_df_tfidf_train = pd.DataFrame(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_matr_df_tfidf_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#Heat map for test data
conf_matr_df_tfidf_test = pd.DataFrame(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_matr_df_tfidf_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
from scipy.sparse import hstack
X_train_merge = hstack((categories_one_hot_train, sub_cat_one_hot_train, school_state_one_hot_train, project_grade_cat_one_hot_train, teacher_prefix_cat_one_hot_train, price_data_train, quant_train, prev_no_projects_train,title_tfidf_w2v_train, essay_tfidf_w2v_train)).tocsr()
X_test_merge = hstack((categories_one_hot_test, sub_cat_one_hot_test, school_state_one_hot_test, project_grade_cat_one_hot_test, teacher_prefix_cat_one_hot_test, price_data_test, quant_test, prev_no_projects_test,title_tfidf_w2v_test, essay_tfidf_w2v_test)).tocsr()
print("Final Data matrix")
print("="*100)
print(X_train_merge.shape, Y_train.shape)
print(X_test_merge.shape, Y_test.shape)
print("="*100)
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
parameters = {'max_depth':[1, 5, 10, 50, 100, 500], 'min_samples_split': [5, 10, 100, 500]}
clf = GridSearchCV(dt, parameters, cv= 10, scoring='roc_auc')
clf.fit(X_train_merge,Y_train)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score']
cv_auc_std= clf.cv_results_['std_test_score']
parm_max_depth = clf.cv_results_['param_max_depth']
param_min_samples_split = clf.cv_results_['param_min_samples_split']
#https://towardsdatascience.com/using-3d-visualizations-to-tune-hyperparameters-of-ml-models-with-python-ba2885eab2e9
df_gridsearch = pd.DataFrame(clf.cv_results_)
#Maximum AUC score on train set VS max_depth, min_samples_split
max_scores = df_gridsearch.groupby(['param_max_depth',
'param_min_samples_split']).max().unstack()[['mean_test_score', 'mean_train_score']]
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on train set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_train_score, annot=True, fmt='.4g');
plt.title(title);
#Maximum AUC score on test set VS max_depth, min_samples_split
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on test set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_test_score, annot=True, fmt='.4g');
plt.title(title);
from sklearn.metrics import roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth = 5, min_samples_split = 5,class_weight='balanced')
clf = dt.fit(X_train_merge, Y_train)
y_train_pred = dt.predict_proba(X_train_merge)[:,1]
y_test_pred = dt.predict_proba(X_test_merge)[:,1]
train_fpr, train_tpr, tr_thresholds = roc_curve(Y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(Y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("True Positive Rate(TPR)")
plt.ylabel("False Positive Rate(FPR)")
plt.title("AUC")
plt.grid(True)
plt.show()
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr)))
print("="*100)
print("Test confusion matrix")
print(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)))
print("="*100)
# heat map for train data
conf_matr_df_tfidf_train = pd.DataFrame(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_matr_df_tfidf_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#Heat map for test data
conf_matr_df_tfidf_test = pd.DataFrame(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_matr_df_tfidf_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
from scipy.sparse import hstack
X_train_merge = hstack((categories_one_hot_train, sub_cat_one_hot_train, school_state_one_hot_train, project_grade_cat_one_hot_train, teacher_prefix_cat_one_hot_train, price_data_train, quant_train, prev_no_projects_train,title_tfidf_train, essay_tfidf_train)).tocsr()
X_test_merge = hstack((categories_one_hot_test, sub_cat_one_hot_test, school_state_one_hot_test, project_grade_cat_one_hot_test, teacher_prefix_cat_one_hot_test, price_data_test, quant_test, prev_no_projects_test,title_tfidf_test, essay_tfidf_test)).tocsr()
print("Final Data matrix")
print("="*100)
print(X_train_merge.shape, Y_train.shape)
print(X_test_merge.shape, Y_test.shape)
print("="*100)
#https://scikit-learn.org/stable/modules/feature_selection.html
from sklearn.ensemble import ExtraTreesClassifier
import pandas as pd
clf = ExtraTreesClassifier()
df_tfidf_5k = pd.DataFrame(X_train_merge.todense())
df_tfidf_5k.columns = feature_names_tfidf
clf = clf.fit(df_tfidf_5k,Y_train)
# https://datascience.stackexchange.com/questions/31406/tree-decisiontree-feature-importances-numbers-correspond-to-how-features
tfidf_5k_fimpt = {}
tfidf_5k_fimpt = dict(zip(feature_names_tfidf, clf.feature_importances_))
#https://stackoverflow.com/questions/16772071/sort-dict-by-value-python
tfidf_5k_fimpt = sorted(tfidf_5k_fimpt.items(), key=lambda x: x[1], reverse=True)
tfidf_5k_fimpt = tfidf_5k_fimpt[:5000]
#https://stackoverflow.com/questions/22412258/get-the-first-element-of-each-tuple-in-a-list-in-python
tfidf_5k_fimpt = [ seq[0] for seq in tfidf_5k_fimpt if seq[1] > 0.0 ] # choosing features greater than 0
df_tfidf_5k = df_tfidf_5k[tfidf_5k_fimpt]
df_5k_test = pd.DataFrame(X_test_merge.todense(),columns = feature_names_tfidf)
df_5k_test = df_5k_test[tfidf_5k_fimpt]
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
parameters = {'max_depth':[1, 5, 10, 50, 100, 500], 'min_samples_split': [5, 10, 100, 500]}
clf = GridSearchCV(dt, parameters, cv= 10, scoring='roc_auc')
clf.fit(df_tfidf_5k,Y_train)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score']
cv_auc_std= clf.cv_results_['std_test_score']
parm_max_depth = clf.cv_results_['param_max_depth']
param_min_samples_split = clf.cv_results_['param_min_samples_split']
#https://towardsdatascience.com/using-3d-visualizations-to-tune-hyperparameters-of-ml-models-with-python-ba2885eab2e9
df_gridsearch = pd.DataFrame(clf.cv_results_)
#Maximum AUC score on train set VS max_depth, min_samples_split
max_scores = df_gridsearch.groupby(['param_max_depth',
'param_min_samples_split']).max().unstack()[['mean_test_score', 'mean_train_score']]
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on train set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_train_score, annot=True, fmt='.4g');
plt.title(title);
#Maximum AUC score on test set VS max_depth, min_samples_split
plt.rcParams["figure.figsize"] = (10, 7)
title = 'Maximum AUC score on test set VS max_depth, min_samples_split'
sns.heatmap(max_scores.mean_test_score, annot=True, fmt='.4g');
plt.title(title);
from sklearn.metrics import roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth = 50, min_samples_split = 500,class_weight='balanced')
clf = dt.fit(df_tfidf_5k, Y_train)
y_train_pred = dt.predict_proba(df_tfidf_5k)[:,1]
y_test_pred = dt.predict_proba(df_5k_test)[:,1]
train_fpr, train_tpr, tr_thresholds = roc_curve(Y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(Y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("True Positive Rate(TPR)")
plt.ylabel("False Positive Rate(FPR)")
plt.title("AUC")
plt.grid(True)
plt.show()
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("="*100)
print("Test confusion matrix")
print(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
print("="*100)
conf_mat_BOW_train = pd.DataFrame(confusion_matrix(Y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_mat_BOW_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
conf_mat_BOW_test= pd.DataFrame(confusion_matrix(Y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)), range(2),range(2))
sns.set(font_scale=1.4)
sns.heatmap(conf_mat_BOW_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
from prettytable import PrettyTable
x_pretty_table = PrettyTable()
x_pretty_table.field_names = ["Model Type","Vectorizer","max_depth","min_sample_split","Train-AUC","Test-AUC"]
x_pretty_table.add_row(["Decision Tree","BOW",10,500,0.62,0.60])
x_pretty_table.add_row([ "Decision Tree","TFIDF",50,500,0.78, 0.56])
x_pretty_table.add_row([ "Decision Tree","AVG W2V",5,10,0.68,0.58])
x_pretty_table.add_row([ "Decision Tree","TFIDF W2V",5,5,0.57,0.56])
x_pretty_table.add_row([ "Decision Tree :Top 5k Features","TFIDF",50,500,0.78,0.56])
print(x_pretty_table)